1- Data Preparation

The Data

On dispose des données de vente journalière de chez BND. L’extraction contient l’historique de ventes entre le premier trimestre 2016 jusqu’à la 3 semaine de janvier 2018.

Les ventes sont groupées au niveau P2 où tous les produits sont pris en compte séparément et en incluant les ventes promotionnelles. Le groupement des ventes par client est fait au niveau C4 où les résultats sont sommés par distributeur.

In [20]:
product_raw_df.head()
Out[20]:
Product Client 04/01/2016 05/01/2016 06/01/2016 07/01/2016 08/01/2016 09/01/2016 10/01/2016 11/01/2016 ... 14/01/2018 15/01/2018 16/01/2018 17/01/2018 18/01/2018 19/01/2018 20/01/2018 21/01/2018 22/01/2018 23/01/2018
0 GBA001AUC180FS 68L041 0.00 2.00 0.00 2.00 0.00 3.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 GBA001AUC180FS 68L042 0.00 1.00 2.00 1.00 0.00 2.00 0.00 1.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 GBA001BND060FS 68C011 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 GBA001BND060FS 68C120 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 GBA001BND060FS 68C123 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

5 rows × 753 columns

Week Resampling

La fluctuation des ventes journalières de certains produits étant très importantes, nous allons considérer les ventes sur la base des historiques hebdomadaires.

In [21]:
weekly_product_df = product_df.resample('W',axis=1).mean()
product_df.head()
Out[21]:
2016-01-10 00:00:00 2016-01-17 00:00:00 2016-01-24 00:00:00 2016-01-31 00:00:00 2016-02-07 00:00:00 2016-02-14 00:00:00 2016-02-21 00:00:00 2016-02-28 00:00:00 2016-03-06 00:00:00 2016-03-13 00:00:00 ... 2017-11-26 00:00:00 2017-12-03 00:00:00 2017-12-10 00:00:00 2017-12-17 00:00:00 2017-12-24 00:00:00 2017-12-31 00:00:00 2018-01-07 00:00:00 2018-01-14 00:00:00 2018-01-21 00:00:00 2018-01-28 00:00:00
0 1.00 1.14 1.29 1.00 0.71 1.43 1.86 1.57 1.00 1.57 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0.86 1.43 1.57 1.57 1.71 1.29 1.71 1.29 1.00 0.57 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 2.29 1.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

5 rows × 108 columns

Cleaning Series

A series of cleaning functions is applied to raw data in order to get rid of irrelevant and "dirty" data with patterns that could disrupt the model.

Each function has a threashold parameter in order to adapt the filtering

Trimming Zeros from series

Remove empty data on the two sides

In [23]:
product_df.head()
The last 8 values (complete zeros) of each series have been dropped 
The first 0 values (complete zeros) of each series have been dropped 
Out[23]:
2016-01-10 00:00:00 2016-01-17 00:00:00 2016-01-24 00:00:00 2016-01-31 00:00:00 2016-02-07 00:00:00 2016-02-14 00:00:00 2016-02-21 00:00:00 2016-02-28 00:00:00 2016-03-06 00:00:00 2016-03-13 00:00:00 ... 2017-10-01 00:00:00 2017-10-08 00:00:00 2017-10-15 00:00:00 2017-10-22 00:00:00 2017-10-29 00:00:00 2017-11-05 00:00:00 2017-11-12 00:00:00 2017-11-19 00:00:00 2017-11-26 00:00:00 2017-12-03 00:00:00
0 1.00 1.14 1.29 1.00 0.71 1.43 1.86 1.57 1.00 1.57 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0.86 1.43 1.57 1.57 1.71 1.29 1.71 1.29 1.00 0.57 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 1.14 1.57 1.14 0.86 1.29 1.14 0.29 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 1.57 2.29 2.00 2.14 2.00 2.43 2.86 1.14 2.29 1.29
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

5 rows × 100 columns

Tailing Zeros: No longer sold

Remove the products that werent sold in the last 15 weeks

In [24]:
t = 15
Series With 15 trailing zeros are removed
Removed: 4327 , Remaining: 4319

Leading Zeros: Recently launched

Remove the products which werent sold in the 15 first weeks

In [25]:
t = 15
Series With more than 15 zeros are removed
Removed: 1472 , Remaining: 2847
In [27]:
product_df_clean = product_df
print(product_df.shape)
product_df.head()
(2847, 100)
Out[27]:
2016-01-10 00:00:00 2016-01-17 00:00:00 2016-01-24 00:00:00 2016-01-31 00:00:00 2016-02-07 00:00:00 2016-02-14 00:00:00 2016-02-21 00:00:00 2016-02-28 00:00:00 2016-03-06 00:00:00 2016-03-13 00:00:00 ... 2017-10-01 00:00:00 2017-10-08 00:00:00 2017-10-15 00:00:00 2017-10-22 00:00:00 2017-10-29 00:00:00 2017-11-05 00:00:00 2017-11-12 00:00:00 2017-11-19 00:00:00 2017-11-26 00:00:00 2017-12-03 00:00:00
37 4.71 4.29 4.71 5.14 4.14 4.57 4.71 5.00 4.00 4.57 ... 2.29 1.71 2.14 1.86 2.29 2.14 1.71 3.00 2.57 0.00
39 8.57 13.71 11.86 6.29 6.29 6.14 6.29 5.14 5.29 6.14 ... 7.86 9.71 19.29 17.29 9.00 8.00 7.14 0.00 0.00 0.00
40 3.71 0.86 0.43 0.14 0.29 0.29 1.00 0.43 2.00 4.14 ... 1.29 1.29 1.29 1.29 1.00 1.29 1.00 0.00 0.00 0.00
44 1.14 1.29 0.86 1.43 2.86 1.29 1.57 1.86 1.86 1.71 ... 0.29 0.14 0.29 0.43 0.71 0.29 0.71 0.00 0.00 0.00
45 1.14 0.71 0.86 1.14 1.14 1.14 1.43 0.86 1.14 0.86 ... 0.86 1.29 1.29 1.29 1.00 1.00 0.57 0.00 0.00 0.00

5 rows × 100 columns

Rolling Average

Apply a rolling average filter using a window of size = 4 , then remove rare sales

In [28]:
window = 4
(2847, 97)

Remove Outliers - Winsorizing

Winsorizing is to set all outliers to a specified percentile of the data; for example, a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile

In [29]:
t = 4
C:\Users\rahmim00\AppData\Local\Continuum\anaconda3\envs\dev_py34\lib\site-packages\scipy\stats\stats.py:2245: RuntimeWarning: invalid value encountered in true_divide
  np.expand_dims(sstd, axis=axis))
C:\Users\rahmim00\AppData\Local\Continuum\anaconda3\envs\dev_py34\lib\site-packages\ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in greater
  

Mainly Zeros: Rare sales

Remove the products that werent sold for at least 5 weeks

In [30]:
t = 5
Series With less than 5 values are removed
Removed: 1 , Remaining: 2846

Normalize Data

We end up with 2846 product/client . Apply a Z-normalization (subtract the mean and divide by the standard deviation) in order to get a scale-independent data

In [33]:
product_df_full.head()
Out[33]:
Product Client 2016-01-24 00:00:00 2016-01-31 00:00:00 2016-02-07 00:00:00 2016-02-14 00:00:00 2016-02-21 00:00:00 2016-02-28 00:00:00 2016-03-06 00:00:00 2016-03-13 00:00:00 ... 2017-09-24 00:00:00 2017-10-01 00:00:00 2017-10-08 00:00:00 2017-10-15 00:00:00 2017-10-22 00:00:00 2017-10-29 00:00:00 2017-11-05 00:00:00 2017-11-12 00:00:00 2017-11-19 00:00:00 2017-11-26 00:00:00
37 GBA001BND060SS 68A139 1.61 1.61 1.61 1.61 1.61 1.61 1.61 1.61 ... -1.00 -1.26 -1.26 -1.26 -1.26 -1.17 -1.26 -0.96 -0.88 -1.26
39 GBA001BND060SS 68C120 0.51 0.30 -0.40 -0.91 -1.01 -1.10 -1.10 -0.55 ... 1.08 0.06 0.99 1.76 1.80 1.71 0.60 -0.98 -1.26 -1.26
40 GBA001BND060SS 68C122 -0.15 -1.04 -1.04 -1.04 -0.96 -0.52 0.48 1.66 ... -0.45 -0.33 -0.22 -0.15 -0.22 -0.22 -0.30 -0.63 -0.89 -1.04
44 GBA001BND060SS 68C126 -0.11 0.65 0.65 0.97 1.16 0.71 0.90 1.03 ... -1.37 -1.43 -1.43 -1.43 -1.43 -1.43 -1.24 -1.43 -1.43 -1.43
45 GBA001BND060SS 68C129 -0.35 -0.35 0.33 1.25 0.79 0.79 0.33 -0.35 ... 0.56 1.25 0.79 1.02 1.25 0.79 -0.35 -1.95 -1.95 -1.95

5 rows × 99 columns

In [34]:
hlp.Clusters_plot(X= X_pca, labels = np.zeros(len(X_pca)))
[ 0.192122    0.13360553  0.0868289   0.07429497  0.04438952]

PCA 01

Affiche une tendance croissante dans le temps.

In [36]:
plt.plot(range(nb_col), princ_axis [1,:])

PCA 02

Oppose les ventes en terme de saisonnalité. Les produits corrélés avec cet axe ont tendance à être vendu durant les périodes d’Automne ou début d’été.

In [37]:
plt.plot(range(nb_col), princ_axis [2,:])
In [49]:
corrSamples = hlp.GetMostCorrelatedTo(X_pca,component,index=product_df.index)
Produit 1744

3. Modeling - Clustering Algorithms

Try out Hierarchical clustering, kMeans and kMedodis on raw (cleaned) data. Then, do a PCA plot to visualize the result of the clustering on the principal components

In [61]:
product_df_full.head()
Out[61]:
Product Client 2016-01-24 00:00:00 2016-01-31 00:00:00 2016-02-07 00:00:00 2016-02-14 00:00:00 2016-02-21 00:00:00 2016-02-28 00:00:00 2016-03-06 00:00:00 2016-03-13 00:00:00 ... 2017-09-24 00:00:00 2017-10-01 00:00:00 2017-10-08 00:00:00 2017-10-15 00:00:00 2017-10-22 00:00:00 2017-10-29 00:00:00 2017-11-05 00:00:00 2017-11-12 00:00:00 2017-11-19 00:00:00 2017-11-26 00:00:00
0 GBA001BND060SS 68A139 1.61 1.61 1.61 1.61 1.61 1.61 1.61 1.61 ... -1.00 -1.26 -1.26 -1.26 -1.26 -1.17 -1.26 -0.96 -0.88 -1.26
1 GBA001BND060SS 68C120 0.51 0.30 -0.40 -0.91 -1.01 -1.10 -1.10 -0.55 ... 1.08 0.06 0.99 1.76 1.80 1.71 0.60 -0.98 -1.26 -1.26
2 GBA001BND060SS 68C122 -0.15 -1.04 -1.04 -1.04 -0.96 -0.52 0.48 1.66 ... -0.45 -0.33 -0.22 -0.15 -0.22 -0.22 -0.30 -0.63 -0.89 -1.04
3 GBA001BND060SS 68C126 -0.11 0.65 0.65 0.97 1.16 0.71 0.90 1.03 ... -1.37 -1.43 -1.43 -1.43 -1.43 -1.43 -1.24 -1.43 -1.43 -1.43
4 GBA001BND060SS 68C129 -0.35 -0.35 0.33 1.25 0.79 0.79 0.33 -0.35 ... 0.56 1.25 0.79 1.02 1.25 0.79 -0.35 -1.95 -1.95 -1.95

5 rows × 99 columns

Métriques d'évaluation

SSE: La somme des carrés est une mesure de variation ou d'écart par rapport à la moyenne. Elle représente la somme des carrés des différences par rapport au centre du cluster.

Silhouette Score: Permet d'évaluer si chaque point appartient au « bon » cluster : est-il proche des points du cluster auquel il appartient ? Est-il loin des autres points ?

In [ ]:
def getSSE(samples,labels):
    return np.sum( (samples-labels)**2)

def getSilouhaite(samples,labels):
    return metrics.silhouette_score(samples,labels)

Agglomerative - (CAH)

Le principe est de rassembler les individus qui affichent des courbes similaires (en distance euclidienne ou en corrélation). Les distances sont calculées entre les produits deux à deux. Plus deux observations seront dissemblables, plus la distance sera importante. La CAH va ensuite rassembler les individus de manière itérative afin de produire un dendrogramme ou arbre de classification

In [59]:
hierarchy.linkage(X_z, method='complete',metric='euclidean')
hierarchy.dendrogram(Z, truncate_mode='lastp', p=100, leaf_rotation=90., leaf_font_size=7., show_contracted=True)
In [253]:
plot_insertie()

Critère de Ward

La méthode de Ward consiste à regrouper les classes de façon que l'augmentation de l'inertie interclasse soit maximum.

In [254]:
n_cluster = 116
ward = AgglomerativeClustering(n_clusters=n_cluster, linkage='ward').fit(X_z)

BIRCH Algorithm

Used to perform hierarchical clustering over particularly large data-sets.The advantage of using BIRCH algorithm is its ability to incrementally & dynamically cluster incoming.

In [75]:
labels_birch = Birch(n_clusters= n_cluster, threshold=0.5, compute_labels=True).fit_predict(X_z)

Partitionning Algorithms

SOM (Cartes topologiques auto-organisatrices)

Les cartes auto-organisatrices sont constituées d'une grille. Dans chaque nœud de la grille se trouve un « neurone ». Chaque neurone est lié à un vecteur référent, responsable d'une zone dans l'espace des données (Un certain nombre de courbes similaires).

In [67]:
som = sompy.SOMFactory().build(X_z, mapsize, mask=None, mapshape='rectangular', lattice='rect', normalization='var', initialization='pca', neighborhood='gaussian', training='batch', name='sompy')  
maxtrainlen %d inf
maxtrainlen %d inf
Topographic error = 0.153900210822; Quantization error = 6.2435798326

K-Means Algorithm

The main idea is to define k centers, one for each cluster. then take each point belonging to a given data set and associate it to the nearest center. When no point is pending, re-calculate k new centroids as barycenter of the clusters. A new binding has to be done between the same data set points and the nearest new center. repeat until no more changes are done ie centers do not move any more.

K-means: Validate the number of clusters

In [81]:
plot_inertia()
plot_silhouaite()

Apply K-means model

In [255]:
kmeans = KMeans(n_clusters=92).fit(X_z)

K-medoids (PAM Algorithm)

The k-medoids algorithm is a clustering algorithm related to the k-means algorithm. K-means attempts to minimize the total squared error, while k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as centers ( medoids or exemplars).

K-medoids Validate the number of clusters

Custom distances calculation

In [103]:
corr_distance = squareform(pdist(X_z, 'correlation')) #Pearsons
euclid_distance = squareform(pdist(X_z, 'euclidean'))
In [139]:
plot_inertia()
plot_silhouaite()
In [195]:
label, medoids_euc = kMedoids.cluster(euclid_distance,k= 113)
In [258]:
df.nlargest(7,"SSE")
Out[258]:
SSE
Agg_complete 521,110.11
Ward 510,172.34
Birch 504,761.40
kMeans 500,011.45
kMedoids_corr 178,083.81
kMedoids 165,892.85

Display Clustering results

In [224]:
disp = hlp.Cluster_series_plot(data_df = product_df_full, cluster_df = eucl_df,headers = row_headers)

Forecast quality on clusters - First tests

In this section we'll build forecasts on a cluster level using classical Holt-Winter's models. Our goal is to evaluate the quality of the prediction on individual series composing the each cluster after performing a weighted split.

The evaluation metric used is the MASE (mean absolute scaled error) which is a scale-independant metric calculated wrt the MAE

In [133]:
group_series.head()
Out[133]:
2016-01-10 00:00:00 2016-01-17 00:00:00 2016-01-24 00:00:00 2016-01-31 00:00:00 2016-02-07 00:00:00 2016-02-14 00:00:00 2016-02-21 00:00:00 2016-02-28 00:00:00 2016-03-06 00:00:00 2016-03-13 00:00:00 ... 2017-10-01 00:00:00 2017-10-08 00:00:00 2017-10-15 00:00:00 2017-10-22 00:00:00 2017-10-29 00:00:00 2017-11-05 00:00:00 2017-11-12 00:00:00 2017-11-19 00:00:00 2017-11-26 00:00:00 2017-12-03 00:00:00
Cluster
1 401 467 422 520 466 219 253 286 283 506 ... 350 422 224 232 179 193 192 107 134 19
2 97 103 66 36 37 36 46 296 379 109 ... 37 130 94 27 15 13 10 10 11 2
3 371 276 119 507 515 148 128 128 101 523 ... 498 466 476 170 122 116 743 183 168 21
4 1076 1185 491 278 251 1624 1366 451 1354 1260 ... 173 139 703 678 219 154 161 125 268 84
5 281 286 256 298 298 338 290 322 324 332 ... 393 311 328 429 311 288 249 143 211 34

5 rows × 100 columns

Building Holt-Winter's Models

In [141]:
forecast = pd.DataFrame(forecasts,columns = group_series.columns[-fc:],index = group_series.index)

Forecast Split Forecast

In [227]:
display(mase_variation.describe())
Agg Split Forecast Increase Bias
count 113.00 113.00 113.00 113.00 113.00
mean 1.28 1.24 1.87 7.92 -27.62
std 0.69 0.73 1.32 52.28 22.98
min 0.25 0.46 0.71 -61.35 -81.73
25% 0.88 0.84 1.14 -22.37 -39.27
50% 1.09 1.05 1.46 -5.56 -28.47
75% 1.56 1.35 2.06 23.26 -17.14
max 3.60 4.89 8.66 286.80 54.02

More numbers...

  1. The percentage of clusters where the forecast quality increases at the split level: We compare the average forecast quality at the split level to the forecast on the aggregation level of the cluster.

  2. The percentage of "good" forecasts where the quality decreases with less than 20%

  3. The percentage of products where the split forecast is better than a specific forecast

In [167]:
display(mase_variation[good_splits].sort_values(by=["Agg"]).head())
Increased Forecast quality: 55.75%
Good Forecast quality: 77.88%
Better than detail Forecast quality: 90.27%
Agg Split Forecast Increase Bias
18 0.25 0.53 1.00 112.72 -46.38
91 0.35 0.57 0.91 64.06 -37.26
99 0.42 0.46 0.71 9.42 -35.83
7 0.42 0.57 1.52 36.52 -62.34
68 0.48 0.74 1.11 54.20 -32.78

Displaying forecasts

In [163]:
cluster = 99
plt.show()
Cluster 99 split:0.46 , agg:0.42 *** Increase 9.42 %